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Abstract 



Design, Analysicr and Reporting Consideratia^s When^""^ • 
ANCOVA-type Techniques Are Used in, Evalxiation Settings ^ 

James L. DiCostanzo . ^ R. Tony Eichelberger ^ 

Learning Research and Development Center 
' Univerpity of PittsWirgh 

Educational researchers often utilize ANCOVA-type techniques 
to assess thiS effects of innovative programs implemented in naturalistic 
settings. Th^s paper delineates and describes analysis and reporting 



considerations for thfe application of ANCOVA-type techniqu^fes in public 
s<!!hool settings, b^sed primarily on a review aend critique* of the national 
Follow Through evaluation. The general areas that are discussed include: 

1. Relating the specific r^esearch hypotheses, the results, 
• * ofithe CO r,re^ ponding ANCOVA data analyses^ a,nd the 

general evaluation question(s). ' > * / 

2. Defining tfie variables, describiijg^h^ rationale an^^ ' 
^ procedure for selecting th'e.m^sure, and fib^ecifyiAg 

" the relationships among the measures, vamables, 
, and evaluation questions.. 

3. Stating the criteria utilized to decide whet|^er a speci 
fic ANCOVA analysis should be made and/interpreted. 

^ 4^* Delineating the criteria utilized to determine whether 
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a covariate is to be include^ in a specifi^ analysis! 

' ^ I 
3. Describing in Retail the differept groxxpsj \ncluded in 

the ANCOVA analyses, and th^^roups' jeducational 

experienced^ 



•-^ .Each of^ese fiv.e general topics contains numerous sugg 
designecl to assist the reader of reports utilizing ANtOVA-trype te 
.to; (a) assess the appropriateness of the technique applied, (b) examine 
possible J^lternative interpretatior^s of thc^results, and (c) pli^c 



author's Conclusions in a njore accurate perspective 
is needed by researchers to indicate the limite^d natujr 



as well asdts strengths, so that a mure balanced dci 
and its effects is presented to a cfecisioti maker. 
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*K * Design, Analysis and Reporting Consideratipns WhenN 

.ANGOVA-type Techniques Are^Used in, Evaluation Settihgs . • 

Sinc« the early 1960*6, the federal government has avithorized 

\ . * » 

and funded numerous *socii^l actioh programs, many <^f which focused on 

compensatory education. I^he evaMtkiiorxB of these program'6 have usually 

be en attempts to impl^ritsnt an experimental paradigm designed Xo maximi 

internal validity. Since manipulation of important variables is rarely, 

if ever, possible^(and oft^n not appropriate) in evaluiftion settings, some 

•type of analysis of covariance (ANCOVA) technique is- frequently utilized 
to compensate statistically for the lack of expel^irnental control. 

Use and interpretation of the ANCCJVA technique is extiremely 
complex, requiring that nurherojus assumptions and conditions be met 

^ if meaningful interpretatjfons are tb be applied to^educational settings. / 
These assumptions are nevef precisely met^in an evaluation settings so 
th^ extentlof the deviations and their impact on meaningful interpretations 
must be assessed an/f presented in tKe evaluation.' * ^ 

The purpose of this paper is to indicate' specific information 
thai ahould be^ included in an ^evaluation repojft when ATicoVA-type^tech- 
nix^utes are used, , to triable tKe\f eader to ,5tccurat^ly assess the adequa^y^ 
of the techrii<jue and the appropriateness of the evaluator's interpretation 
of tife results. Specific examples of the kinds of. problems that arise 
wh^^ collecting d^tta in. scrfool settings are described to illustrate the • 
need for this additional information. -Suggestions *^or alternative ways 
of presenting the needed information in an evaiijattion report 4tre included 
and discussed. 5, . / 

^ The commen^tsyind suggestions m^^de* in this paper follow pri- 
marily ifrom a'^reVie.w and critique of the longitudinal evaluation of the 
natlon^il Follow Through. program^ This evaluationHs an excellent 
eatampje of the application of ah. ANCOVA-type analysis technique in a 
^iypical'e^valuation'^setting. Because of its s'fcop.e and duration, the FoUo> 
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'Through' eyalu^tioa encountered many of the problerns that >\re typically 
experienced .evalioators using this technicjue; . .* i - 
I ' National Follow Through' Program 

A brief historical sketch of the national Follow Through prograrxi 

• i • * > ' ' 

an^ its chanAittg purposes is needed to understand and appreciate the 

ftiethodological issues discussed in the remainder of thi^ paper.* In * 

1966 there were indications that Head Start, a federally -'funded compen-^ 

satorV education program for disadvantaged preschool childiren, was 

having same jpositive effects, but that the effects did not endure through* 

the early elementary school years. The Follow'Jhrough (PT) program . . 

was planned as a massive service program, and was designed to e^te^ ^ 

compensatory education {aamilar to that afforded the Head Start 

children) fron^ kindergarten through grade thr.ee (Johnson, 1967). When 

FT was originally funded, only $15 million was appropriated for two 

years, rather than the $120 million that v^as, expected.' Follow Through 

then became a planned variation experirffcnt in which diverse types of , / 

innovative programs were implement^ in various sites throughout the ^, 

U.S. Rather than asiigning^ progrsfms to sites or projects, participating^ 

local districts, in cooperation with the programs^ sponsors^ ^wtr^e 

allowed to select the instructional model to t>e impleipeiited in their 

• , ^ / \ ^ % 

project. * Although this procedure later caused somo methodological 

problems, it is probably more representative of the operation of U^-S. 

publip schools than is th^ random a3signment of programs to sites.. 

* In the initial two years of the FT program (1967-68 and i^8-69), 

the evaluation focus waV^omewHat confused,' due primarily to the change 

in the program emphasis from service to a planned variation experim«nt, 

and the dissociated administrative problems. In J968-6'9, several 

purposes for the natipnal' Follow Through evaluation were delineated, ^ 

including: ' , ' ^ • - * ' • 

1, • Assess.ing prog ramH^n pact on pupils, parents, 
schools., and community (EmrixHc, Sorenson 
: - ^ Ste'arns, 1973, p/,72). 
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2. Assessing relative effectiveness of different pro- . , 
y grarrls.and program approache's (Sorensea Mado^, 

• . 1969, p. 4): ' " ' ' ' . ^ 

3. Establishing criteria (of effectiveness and success 
^ of the national t'T pro^ram^ (Sorensen L Madow, 

1969, p.. 4). 

• In thife p^per, we are concerned with selected aspects of these 

' three purposes, which deal with the impact of the FT programs. Th^sc 

purposes were accomplished, to-a'large exterit, using ANCOVA. ^ 

Approximately 60^f the 170 local projects ^Representing 12 of 

22 FT sponsor mo^dels were included in the national FT evaluation. In 

each FT school district, students identified as similar to those partiei-- 

. pating in FT com^rri'sed the Non-Follow Through {"NET) sample, and 

^ were tested on a regular basis by Stanford Research Institute (SRI), the 

organization contracted to collect all F^T evaluation data* When compar- 

. able students coul^ not be idemtified locally, a comparison or control 

group from a neigHboring school district was identified and tested* 
^ • » ^\ 

Noncom*parability oi the FT and NFT groups at a particular site was 
often a result of the sch^l district's policy of assigning the most dis- 
^advantaged children'to the FT program. Nqj^comparability, for this and 
other reasons, was . an on-going problem in the evaluation that the use • 
of ANCOVA attempted to alleviate. 

Decision makers associated with the early years of FT w.ere • 
* confid^ent that the prograiji would have a marked impact on the, partici- 
pating children. Richard Egbert, the original FT Directf r, ind'^qated 
that the evaluation design wa^ based on thc^ 

children's cjeveldpment wo old be ^ markedly su|jorior 
as to be readily demonstrated on measures of achieve- 
ment, .cognition, self-eoncept, social maturation, and 
^ * ,H capacity to fu^nction independently. Follow Through'^ 

d^tgn was bori^ also trom the conviction that unless 

6^h su]>stantial differences were manifest, the really 
. . • , . 

rnas^ive increases in spending that would be required" 
could not be justified (1973, p. Z5).' 
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These convictions se,em to- have resulted in less concern with detail^ of 
^he design, since any' reasonable evaluation of FT would, readily show 
the impact and effectiveness of the program. 

The evaluation -has vacillated in emphasis from a decision 
orientation of identifying the ^l^esf' modeKs) overall, to a descriptive 
orientation in which different feffects of individual models would be des- 
cribed. Initially, SRI was awarded an evaluation contract^to identify the 
most effective program modeHs) and to pruvicie descriptive information 
to project administrators and other, school administrators. At various 
times, it was decided ihat; a consumer's guide, which would list indi- 
vidual sponsor's objectives and the degree to which the objectives were 
met, was to be produced by- SRi. Since 1972, the major objective of the 
national FT evaluation tias been to identify the successful model(s) and ^ 
to document the impact of the mtodels oh pupils. An ANCOVA-type pro- 
cedure*has been utilized for this purpose. 

SRI and Abt Associates,^ the major contractors for the longitu- 
dinal evaluation of the impact of FT, have produced four reports. The* 
SRI report covered the interim years of FT, 1969-71. Abt Associates 
have produced.three' reports covering the years 1972 through 1975. 
The SRI report (Emrick et al. } 1973> and the most recent. Abt report 
(Stetrbin's, 1976), will be used for illusW^tive purposes in this paper. 

Analysis of Covariaiice fANCOVA) 
As indicated above,' ANCOVA is o.ften used in evaluation 
settings where it is difficult or impossible to control experimentally 
alternative explanations of educational outcomes. In situations where ^ 
its use is appropriate, it allows groups to- be compared 6n a criterion 
variable that has been adju^fed on a set of concomitant variables, or 
cdvariate^. Statistically, ANC^bVA is used to increase the precision 
of the analysis by taking advantage of the linear relationships between 
, the dependent variable(s) and the co'variate(s). s^order;for ANCOYA 
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to be unambiguously used, however, its. asauniptions and conditions mda»t • 
be precisely met. Failure to^^o so may distort the results in ways that 
Iliake,their interpretation equivocal, if not meaningless. 

We believe that the consutner of the evaluation report must be 
able to: (a) asjiess^t^e appropriatenpfes of ANCOVA whenever it is used, 
and, (b) examine possible alternative interpretations of t?he results. 
For these purposes, information regarding the conformity or nonconformity 
to^ the assumptions and- conditions of ANCOVA', and other* info irnation that 
would enable alternative interpretations of the results to be made, must 
be available in the report. , ♦ . ^ 

^ajor Areas of Concern 

We have delineated some information w^' believe is necessary 

for the reader to achieve the two purposes statec^ above, and we have 

organized it into five topical areas. Each area is focused by one or more 

question^ that the cvaluator should address. The national evaluation of 

the FT program \^ks been examined as a typical application of aVjCOVA 

in an evaluation setting. Reports of-this evaluation effort will be used 

to illustrate the points that we will make in each section. 

1. How are the specific rese^arch hypotheses investigated and 
the results of the cor r esponcjin^ ANCOVA flata analyses 
related to the general evaluation question( s) * 

It is generally accepted that po empirical process completely ' 

assesses an event, ^nd e^/aluation is no exception. With limited resources. 



Abt Associates^ evaluation report (Stebbins, 1976) discussed 
several problems associated with the analysis of data* collected in the FT 
evaluation setting. We have selectively drawn examples from* that re- , 
port to illustrate our points, and as a result, our paper tends to empha- 
size only th^ rnost questionable auialysis and reporting, procedure s in the 
Abt report. Abt had the very difficult task of attempting to draw conclu- 
sions from a complex hon- experimental setting. See Appendix A for a* 
brief statement of Abt's view of theii" role and situation (Stebbins, 1976, 
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especially of tirne, money, and personnel, Evaluation can only addre$s 
eonn'e aspects of a general evaluation ques£ioT>, * ? 

An evaluation ^is define'd by the specific research questions'or 
hypothese^ that are investigated* The»selection of hypotheses to.be 
'tested or question-s to be addressed is the result of a feaso'ning process ' 
thatjinks the research hypotheses ta the general que3tion. • The expli- 
CdTtion of this reasoning process, or , rationales permits 4:he reati^r jo£ the / 
evatuatipn report to identify and assess^ the cornpopfents that ar^g^luded , 
as well as those that are not, m ord.^r to answer the general evaluation 

' ■ >. ■ 

question. This explication is crucial, especially in large-scale evalua- 
tions wl^re the inferential process relating 'thf overall stion to the J 
specific rese^irch 'wpotheses is not obvious. ' ^ 

, One of tho uenc ral impact questions delineated by SRI ^fEmrick 
et al. , 1973) for the national FT evaluation ♦ wa e, ''How effective is 
Follow Thrpugh as a method of improving the life chances of participating 
children'^" (p. 72). Three, research questions concerned with the 
academic perforn^iapc^ of FT pupils ajid attitudi'nal changes of their . * ^ 



parents and teachers wcore delineated to address this general question. 

^ How these academic performance and attitudinal change vari- 
ables relate to improved ;iife chances is not immediately apparent. A * 
rationale that relates them is needed to enable the reader to gain ah 
appropriate perspective for viewi;ig tht cvaluatton results, ^ohen and^..,*^"' 

Garet (1975) describe one Ime of reasoning in their article on social- ^ 

( ^ 

Okilicv research: i 



c 



2 . . . , 

In an evaluation the variables utilized are usually specified at 

t three different levels, ^First, the general atea of focus, si^ch^ms prograj 

impact on participants or program effe.ctiveness, is delineated. Next, 

the spi^cific aspects of the area of interest that are^ to be investigated 

are specified as the questions to b^ acfdressed. Finally, each question 

' * addressed by one oi* more statistical analyses. We ar<& calling these 

levels: (a) the general, or overall^ evalu^ition questiorl's, lb) the rc- 

* • search questions, and (c) the. statistical hypotheses, whicli are 

f operationalized by the actual data analyses carried out. 



Ih the late 1950'8 ahd early 1960*8, fo^ example, a 
* nati9nal policy^concerning educational opportunity Ifega'n ^ 

to^feke sKape, It rested partly on the^dea that poverty, 

? unemployment and^linquehcy resulted from the absence 
of particular skills and attitudes reading ability, moti*^ 
vation^o achieve in school and the like. There was also 
an assumption that schools inculcated these skills and ^ 
attitudes and that ictfiiijring them would le^d to economic 
and pccupational success, ,ln other words, this policy 
• * assumed that doing \^ell in schools led to doing well in 

life, (p, 21) . ^ ' . , 

By ^ecifying the rationale, the ^valuator clafrifies the view- 
point on whijch the ^evaluation is based, and enables thjk reajler to under- 
stand the intentions of the evaluation,^ Whether th^^eader agrees witli 

the eval\aatar*s. logic or not, we believe that scrutiny ^o'f^ it is necessary 
' f 

for the reader to assess and interpret adequately the evaluation report , 
and the conclusions drawn from the investigation, 

« 

The ne.ed to spexzify the litik between the statistical reseiarch 
hypotheses and the overall evaluation questions has been disc^iBsed, 
Similarly, specifying^the relationship between the statistical hypotheses 
\ct\ially tested and the corresponding tefcealrch hypotheses is nefeded. 
Often the*statistical hypotheces tested are^not stated in tl^ evaluation 
Report, Ifi evaluation studies or analyses that are not cornplex, the 
specific hypothesis that ia tested can easily be inferred from a desc^-ip- " 
tion of; the analysis performed. This is 2^, much mort difficult task when 
multifile dependent and concomitant variables ar< analyzed, or numerous 
analyses are used to investigate each research question. 

Abt Associates' national evaluation .of 'FT (Stebbins, 1976) is . 
a good example of a coqgplex evaluation utilizing numerous sophistica*ted 
analyses. An example from this evaluation follows that illustrates the 
problem and indicates an approach for dealing with it. One of the 
general evaluation questions addressed in their report was, "Does FoUov 
Through have a ^reatw lm|>act on disadvantaged- children than do 
regular school programs'>'" (Stebbins, 1*976, p. A -8), The Wnp<Lct 



question was addressed by.a number of ANC(^A analyses cornparing 
FT with local, best-maJtch, and nalfional FT groups.. These results/ 
are reported in what are call^ Summary of Effects tables (see Table 1 
for sample)'. 



Insert Tablfe 1 about here 
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Tifteea analyses were made for each Follow Through site re- 
ported one for each variable- listed in Table ^1. Ar\ exatnple of a re- 
search question might be stated as: 

-Is the mean reading achievement test^scCre of partici- 
pating FT students greater than that of NFT studentjs 
when the effects of: ^ 

\ ^ 

a. Fall kindergarten WRAT 

b. First language • ' , 

c. Family income . - 
^ *d. ' Highes^occupation in family * ^ 

e. *' Ethnic membexship 
^ f# Sex , ♦ » * 

g. Entry age * 

h. Missing data code for WRAT ' ^ 
• . i. Missing data code for income 

* ' j. Missing data 'code for occupation 

are statistically 'controlled (where reading achievement 
is definediJas thjs Total Reading ^core of the Metropolitan 
^ Achievement y est, wWch is'comprised of the Word 

'Knowledge and- Reading subtests) « i - . 

As indicated in Table 1, comparisons were made between the 

FT students at each site ^ind three diffe^ent rvIFT groups: I'Ocal, best- 

match, and pooled. Nine of these comparisons de^tl directly with the 

question of impact on reading: tl|ree each for Total Reading, Word 

Knowledge and Reading. The lat/ter two (Jf these are subtests that make 

up the Total Reading score. Of course, r^J^'e of the six comparisons 

ii^volving the Word Kno-wle^dge and Reading subtests are independent 

of the Total Reading Comparisons, but this % not -specified in the table 

of effects or associated discussion. 



' When the -fepeclfic research hypothesis addressed.or the static-'* 

' * ' * t * ^ *S * * ' * * * ' \ 

*tical TiypotTiesis^^aVrually tested is 'not stated, *the i*eadeif is le'ft \^ith life ^ 

' ^ * * ' " *• ' ^ '^ 

vague impresakin that ^verytffin^ A^^^it *sho%Jd have bc^cn controlled was • 
^ '«'♦-*. # • » - * ' 

co»trolled^ and!" the ,riarnerous comparisons reported aiust hay^ assessed 
the riyV^^lfe^^^' V^^^^^-s o'^^'^^i"^ ton^prehcns.i,ycly. Wc are 

sure tli'alt'the authors did not mean to leave «that im])resaion, ami they 
a^sume^d that any sophisticated reader would rntelt*prct th^nr analys'es 
a^d interpretations appropriitelTvafid with much Caution — given the 
numerous caveats and e^plafiations included ir^ the fi-rst par-t o-rthe re- v 
p6rt. But, 'in any 400, page report with an ^tdditional 40y pagers of. 
.kppetidices^ the reader will have difficulty figuring out ho w the scores , 
that define reading were obtained, what they represent, - and wha^the 
evaluadors think t^y^^bpresent. -vThe* same problem exists for each of - 
the ten or more cova^ate's. \ . » ' . 

^ This is^a complect and <^iff icult problem faced by every evalu-* 
atof'at one time or another, and we do not want^'to fgJdress issues a'boitf 
the role of evaluation and of evaluation reports. "Our concern is £hat 
evaluation reports describe a& cloarly as pps'sible the ev^luatiOn'activi- 
ti^s undertaken to answer the g.eneral evaluation questions^ and commu||^ 
c>i.te as precisely as^'possible the r elation^jlips betwij^en the general evaliT- 
ation questions, the research hypotheses, and the s\ati&ti'cal hypotheses 
actually tested. Amb^uity in a Tnassive,^compl^x evaluation tends^td 
communicate to the re'^der that^ eveV-ything was done that could pQSSibl^ 

be done and the evalualor^s conclusions ',are the ''best" interpretations, if- 

< 

aiot the only apprdf)riate interpretations of The data. There are always 

* 

pressures to make the evaluation as convincing as possible,^ whether 
positi^^e or negative results are obtained. This often results in -gross 
overstatements of findings or the confidence one' sWould h^ve in t|je, find- 
ings, and^does npt -represent well the situation that is being evaluated, 
By specify ing^the general^^valuation questions, t.he research and the , 
statistical hypotheses, and the evaluattor 's view o^^jthe relationrships 



ajnong then1|i^both the strengths and weaknessea pfa complex evaluation 
can be clarified. The lifnited empirical information presented io the 
r'esuliant evaluation' report can then be usedVnore appropriately by deci- 
sion makers and be morfe useful to educational professioriajs j 

2, * Are the variables defined, ^he Tationale aiid the pro- 
cedure for selecting thfe measures described, and, are. 
* l5ie. relationships among thfe measures, variables, ^nd 
^ ' evaluation questions specified? . 

In gener^, three relatibnships are of concern in the measure- > 
ment area: (a) variable /domairt, (b) i|^HH|p^/ variable, and 
(cj in str^iment /domain. Each bfthrese has assaciated with it an inferen- 
tial gap that -must be bridged order to relate directly the empirical 
results 1:9 the intended purposes^of th6 eyaljiation/ The rationales that 
delineate thcBe relationships must bp s^^cified in the report so the 
reader can best assess tlie adequacy of the instrumentation'. • , 

A malor issue in the measut'Cment area i3 the conflicting 
considerations related to'^e Importance and*Scope (Stu/flebeOTn^^ Foley,. 
Gephart, Gubfei, Hammond, Merriniari, & Provus^, 1 971*) of the data 
collected and reporte^t^ "Importance" d^^als with emphasizing the most 
important informs^on and eiiminatin:g7that^which is not valiled. 'Scope", 
isjthe concern about the entire range or the compr'ehensfvenes^^X the 
information included in the evalualtionl Decisions must be made ^bout ^ 
each possibie^ evaluative information and datum. As decisions ^ 

are made about dqnq^ins, variables, and meast^res, practical considera- 
tioris 'of tirhe, money, adequacy of measurement'^irocedures, and the 
like tenjd to limit the evaluation to the most importe^nt variables and 
measures. At the dame ^ime, ^concerns abbut adequately fulfilling the 
purposes of the-evaluation^tend to expand its scope. . , 
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In the naUo!:xal FT levaluation, two dotnains (TOgnitive and 

3 I " 

noncognitive) v^ere identified fojr student outcgUrhies. Some of the vari- 
ables and nneasures used to a-ssess domains are hste;! in Table 2. 



Insert Table 2 a^out he^e 

> • All measures used in an levaluation must be specified and des 

cribed, and the specific variables constructed from these measures 

^y^uist be defined. The specification of the variables and the measures 
of them can usooally be done eMily by using a table such as Table 2. 
When the variables are defined as tests or subtests of standardized tests, 
a short description of the test is usually adequate to enable th^ reader ^to 
undei*stand how each variable is/6eih^ ope rationally defined. 

W.hftiever an evaluation is plann^, a wide range of dom^ns 
and variables are initially identified for possible inclusion. Often 
domains, variables, and measures are excluded during the selection * 
process. Jhe evaluation contractor is usually most knowledgeable about 
the corapromises and deletions that are made. A (Jiscussion.of this ^ 
selection process is seldam^ if ever^ included j.n aft evaluation report. 
Thus, the best thinking about this problem and the rationales for the 
decisions are lost to the field and to society. They are also not availa-ble 
to thQ readers, including major decision makers in Congress, who need 
tdat information so that they can more appropriately assess the relative 
value and importauc^^oi the conclusions of an evaluation report as they, 
relate to decision alternatives. v. " 



3 

SRI identified two domains (cognitive'and noncognitive) as 
'opposed to the three domains '(basic sjcills, cognitive conceptual skills, 
and affective) identified by Abt. In this paper, we will ^eal only with 
the ccTgnitive/nOncognitive distinction --'even though the three domains 
more adequately match the different sponsors' objectives/ ^ 



'An exfimplerof such a vital discussion appeared in Def pgn for 
the Individ\ializre^ Instruction Study (Cooley k Leinhardt, 1975b). The 
first two pages of their rationale for excluding noncognilive. variables in '"^ ^ 
their evalvlation design are included as Appendix B^pf this paper. It indi- 
cates the steps that were followed and the criteria they used to arrive 
at their recommendation. In Section 3 of thfeir report, Cooley and 
iieinhardt present their rationale for using! a standa^rdized achievement 
test to assess cognitive outcomes. The criteria utilized to% compare pos-, 
sible tests are delineated. The actual test reviews are includ.ed m sCh 
appendix ^f their report, where'^the subtests of each achievement battery, 
^ the psychometric characteristic, the normsi available, and othp charac-^^ 
teristics are described. Howev*i there is very little discussion'by the 
authors of the inadequacies^f the test battery that was to be used in the 

evaluation. * ^ ' . • 

• * This example giv^s.a rationale for the domains, vaijLableS, and 

measures to bp included in 4his- evaluation. The. relationships between 
the variables and domains and between the measures and variables are 
descr^jed. In ouV view, it would haVe been helpful to indicate more fully^ 
the strengths and inadexjuacies of the achievement battei-y in assessing 
the specific variable^ and the cognitive, or achievement, ^^d^rpain. 

Since neither Abt ntDr SRI described the procedures add rationales 
that led to delineating the varia):)les and measures utilized in the FT evalu- 
ations, we^ve identified some aspects that ^seem to have.. been considered. 

^ « 1. / Follow Through is an attempt to extend the posi- 

tive effects of the Head Stajt'prografn. "Variables 
y similar to those investigated in the Head Start 
^ evaluation should be included. 

2. FoJ^low Through as^a compensatory eciucatidn pro- 
gram has as its primary emphasis the improvement 
of students* basi<i skills, which in the first Ihree 
' grades are reading and-niathematics. . j 
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3. The spons^s have discernably Aifferent 
approaches to ea^rly chiLdhood education. 
Domajja^ and variables were identified from their 
main priogram objectives. ' ^ 

4, *^ Given the anToiint of time and mondy allocated for 

the FT' evaluation, only the mo^t imporNtut and ^ • 
-\ useable variables could Be investigated^ Thus, . 
some .importfint variables that are gf interest could 
Txot ^ included because valid and reliable mc^asures 
em were not available. ' ^ 

A discussion, of,, each of .the considerations used to make 3teci- 



sioT^s and judgments^on the adequa^^^of, the- domains and variables is 
nee^^^, if th^ complex analyses an^ *interpretatipns are to be meaning- ' 

* fully ondersfood and utilized. The evaluation report should indicate how-^ 
^ese and other considefations affected vari,able, selection, and should 

include the rationales for the choices made. Whin these considerations 
are not included in a large complex evaluation, the reader is often left 
with the irr^res^ion that all inriportant domains and variables were in- 
cluded in the evaluation, and this m^asu/es used did -adequately (and 

• comprehensively) represent them. , • • v 

The two areas discussed above (Questions 1 and Z) are of-a 
general nature, i. e. , ^they are not liiViited in relevancy ^tp evaluations m 
which ANCOVA is u^ed. The remaining three areas deal directly with - 
the application ahd interpretation of ANCOVA in evaluation settings. 
These thfpe areas are concerned with (a) criteria for deciding whether 
a specific analysis'" should be made^and interpr^^ted; (b) ^criteria for ^ 
deciding whether a particular covariate should be included in an analysis,;, 
and', (c) the characteristics of the grdups being compared, and their 
educationaf experiences; The boundaries of these thirete disXussions 
are somewhat arbitrary in^ nature, since the tbpics do overlap. The 
functional problem of whether an an^lysii produces >^^lterpretablij results 
d4j>ertbs on meeting the ANCOVA assumptions, on the specifJc ciyarialcs 
ihcVided in the analysis^, and on tl>e comparability of the 4rou]os/(queb 
tiondS, 4, ancf 5, di-scussed below). 

. 16. 
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■ . ■■ - 7 

3* What criteria were utilized to decide if a 8|ieclfic 

ANCOVA analysis should be made and interpreted? 

(a) To what extent does each comparison .meet these 
• criteria? 

(b> What are the effects o^ the interpretation of re- 
sults of the failure to m^et the criteria? 

The primary jfeason that ANCOVA-type procedures are used 

in evaluation's ettings is to adjust for) or statistic^tlly control, ' other ^ 

likely explanations for the outcomes that are ass^sed. Spch alterna- 

tive explaiiations. are still present ^ter the analyses b^ve b€en^t>m- 

pleted, whether, or not r^adomizationVas used. When theire'is little pr 

no experimental control such as in naturalistic field-'evafluafions r- 

' oiitcorpes are more difficult to interpret and explain.* * 

• In naturalistic field settings, such as FT, deviations from the 
a^ssumptions and conditions necessary to apply and interpret ANGOVA 
wifcfe some degreeK)f precision are often present. The evaluator tnusf ^ 

* attempt to assess the extent otthe deviations and to "delineate ''criteria > 
for deciding when a specific analysis should not be'inlferpreted.- ^ Establish- 

ing the specific values for criteria is an admittedly subjective 'prpcess, 

' 4 , 

as there is little guidance available in the literature.. Thefee valuer 
must be based on the purposes of the evaluation and on the specific ^itua- 
tiona^ in which the evaluation is occurring. In thii section, we discuss 
several criteria that should be considered, .and haw the. utilization might 
be reported. . * . ' 



i 



he first consideration that must be made, especially in a \ 



longitudinal evaluation, is Whether the data in hand are repre'sentative of 



Implicit in much of the literature on ANCOVA is that it should 
not be used in non-experhnental situations, but this is extreme. Selec- 
tive application and cautious interpretation are a mpre practical and use- 
ful approach to using th^s ahd other statistical .nc^ethods. 



0^ 



. • * . * • V , ... . 19 : 

* , . • . ^- * ^ 

% - ' \ ^ ' 

* . . *• . . ^ . . . - 

the situation being evaluated. In thb FT evalu^ttio^n, attrition was often • '\ 
a major problem^ After thr^ee to four years, less thaAj ZS percent of the 
initial FT sample'had complete data in some sites* A* criterion that was 
implemented* by the-Abt evaluation t^^ 'wae that both the FT. and NFP 
groups be comprised of at least 12 students. This small sample si«e 

; V ' ' ' 

would, of course^ overfit -the statistical model, especially when seven 

to ten covariates were used,- but at least the criterion va4ue (12) was 

' explicitly stated* The actual ^mple size^ of the samples included in 

specific analyses can ulfeally be presented in a table summarizing the 

re^^ts (or ^^effects'', iri Abt*s 'terminology). v^In additior^to sample size, 

this table should report what proportion of each site's participating 

students comprise its ^arnplej. The size of this proportion directly influences 
, ♦ ♦ * ** • 

the appropriateness of our conclusions Wrawn from the analyses. The , , 
question o^ proportion. of participants needed is a related but complex ^ 

concern that will not be addressed hfere, but should be considered in each 

' ,t , ' 

" evaluative ^et^ing.. Other considerations that might be of concern'are 

^- . . ' • 

discussed in th^ final section of this petper. ^ • 

A' seconcl set of criteria are the assum'ptibns of ANCOVA.s The * 

four considerations that are usuall^^itjnpbrtant were identified ifi Abt's 

FT report: " • • • ♦ 



1 , 



j^jt , ' * 1. . The covariates are uninfluenced by treatrhent; 

. ' * 2. The distriBtrtloi? of the cov^iriates is not ^ossly 

, ' .^ ^ • different acro^ groups; *f • . 

• • / ' • • • 

t * -3. The relationshifls between covariates .apd criteria^ 

- * ' are the same (ho^nogeneous); and. 



4. The civariates, are perfectly reliable. 

'(Stebbins, 1^7^, p/^-58) * " ' ' - 

^ ^ Each of these-as'sumptions was investigated or discussed in 

*Aht*8 evaluation report. The first assumption was investigated by re-* 

running. ANCOVA c6mpa1-ison^ without the one covariate (WHAT) that 

ythey felt .could be influenced by the .treatment. Violation of this assump^ 



. tibn meant that the portion pf the treatment effect that was confounded 
. * , with WRAT was being inapprop/iate]||^ removed. The report states: 
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If thfe WRAT is influenced by^the fir>t few^ weeks of 
treatment, one might expect pretest adjustments to 
^handicap the FT children^ To test this we removed the 
WRAT from the covariate set and reran the local anal>r- 
ses. *The f-esults of thes^ "no-WRAT covariate" analy- 
V. • .^s'ep do not differ in any ijnportant w^ys from analysed 

which included WRAT as a covariate. We conclude from 
. : this comparison that.th^ WRAT is probably not hindering. 

C ouf analyf es of program effects, * . ' ' ' ' 

(Stebbins, 1976, p>fU^)\ - : ' , ' , 

' The'rerunning of the ANCOVA analyses fol- eac^ site was use- 
'ful in addressing whether the^iise of the available WRAJ data as co- 
vatiate iffecteci't)^e .results obtained and ix^nglusions rl^rted/ It As 
important <o notla tffkt ^roppuig'^ny qne-of ierf >nterrelat^ covariates is^ 
uVil!keIy to affeet^the r^sultd'of an analysis* Dropping WRAT doei^ 
not directly^ddres^ the assumptipn^that the covariates were not in- 
fluenced by treatment! This, assuipption coxild be attended more diriectly 

by giving^ the* pretest eatlier^ spch as in, the^ previous year, before the 

. ' M • . ' ' * ^ ' ' ; ^ , * ' • 
•progfWm -waCs irnplementfiH, . or- in the fir sf twOy\yeeks of prog-ram imple- 

^entation. rpr, it/qbuld -be, addr^&secj'in a-pilo.t study bef^e .the evalua-, 
tion i^s \;^dertaken. * • \^ * , ^ " ^ > 

Our own experiences at the Learning -Research and Development 
Center ("LRDC) have <;onvinced us^that much learnings .a^'assessfdOby _ . 
paper and ]yenci^Achievemerit t^sts, occurs in the fitst few weeks of ou^ 
program. In a^ /study in Pi^tsbufgh area, schools (EicheXber^gef , * 
DiCostanzd*,, & Evaluation Qtaff,' 1?75), students using the LRDC curricula 

•similar! to 'that usecf in F*T; y/er^^ assessed in. the; sixth we^k" of school (as 

• ^ ^ ^ • - . • 1 ' ' • . . • ' . . * 
.were a group of sirtiilar 'students) using the Tyletrbpolitan Readiness T^st 

(MRTl^ .I'h.e results obt^inedfare reported, in Table 3. These results ^ 
; , • - Insert Table 3 a^b9ut.here ^ • » , 



indicate that the thtee .schools using \^e LRDC curricula dtrin'g the 
first fiipC'Weetjia of kindergarten scorjid much higher than synilar 



students wh6 had /iot used that curricula,^ ' Thesq .result-^s suggest proh- 
liems v^ith the assjumptiojx thatthe fall kindergarten WRAT scores were 
unaffected by six weeks, of tjfeatment. , * ' * ^ 

« If the pretest was Afie rentially affected by the treatments in 
FT, use of the ANGJoVA-like»T5rocedure*8 tOr test other assiinnptions and 
to adjust FT and NFt group differences in that evaluation might result 
in inappropriate conclusions. Elashoff ( 1969) suggested that analysis 
of variance be conducted on the covariate to test the assumption, but • • 
in the situation where testing occurs after four to six week]s of school, 
that procedure does npt .directly addre'Bs the ifesue of the^comparability 
of the groups prior *to treatment. The implications of the failure to meet 
the assumjpMon have not been well delineated at this time,* and deserve 
more careful consideration by evaluators using this technique. 

li the evaluator^s concern ij^ to assess the effect of using a speci- 

fic, covariate (such as the'^WRAT) on the results obtained, -then rerunning . 

«^ 

the analyses (withdut the WRAT) is useful. Whenever the "n«-WRAT co- 

variate" analyses result in changes in conclusions for a specific compari- 

son, those results should be reported. The two sets of ANCOVA analyses 

might be performed some specified level of significance (such as . 05), 

and pr^esented in a way that would reflect the different res\ilts that were 

obtained/ When such a large number of reanalyses arfe'made, it-is, of, i^' 

course, im/f>crtant to note the number of differences^found significant 

/ f 

as a proportion of the total number of comparisons made within each * 
program or si^e. • . • * * ^ • r *• 

The second as-sumption we will discuss -- that the covariates 
were rr^eisured without error — Has-been theoretically -studied, but 
what is Jcnown has seldom been applied lo evaluation studies. Conclu- ^ 
sions about educational programs drawn frbm empirical data may not 

represent the situation because'of sampling error^ winch is estimated 

\ ^ > \ 

from repeated use of the same rinbasuremcnt proced,ures.^ The conf:lu- 

' *> • ' • , •* 

sions may also be mislfeadin)g about specific variables, such as academic 



•achievement, because^^the'measyref ii&dequately assess irnp<^rtant . 

'aspects of the Variables. Nfcither oftBjese problems cap be solved^ with 
igreat confidence in an applied setting,: so the complex adjustments are 
not attempted and the inadequacies in measuring the variables are bver- 

lookdM. * ' 

" There is also a tendency to overlook what Coleman, Campbell^ 
Hobson, McPirtland, Mood, Weinfeld and Y6rkil966) called measure- 
ment error. Meaeureme^it error - , ^ 

. . . includ€?s such errors, among others, as ' ' 

* ambigxiities iu definitions and in the, questionnaire, 
failure to obtain required inforn>ation from respon- 
• ' vXients, obtaining inconsistent information, mis- 

takes in clerical cpding and^^editing, errors occurr- 
ing during the machine proceS%ing op,eration, a/id ^ 
, tabulation eyrprs. (Col^mari et air,' 196^ p, 561) , ^ 

In othe^ words,' it ganndt be ass\imed^ that democraphic' and dthe? "con-^ 

, crete^' descriptive data are -measured wit.hout ^^V, ^ 

» In Abt's FT evaluation the authors indidbted that: 

..^Variables such as sex, ethnicity, income, occupa- y 
tion, education, language, and age are all measured 
with a minimum of error. It is only the pretest which 
' poses a problem. (Stebbins, 1976, pr>A-60) 

With problerh^that exist in most self- report^ataV- especially about 
variables like income and occupation atpong the lo^^ S^lS'group it is 
N important- that estimate^ of the^e data's reliability be obtained and re- 
ported when they 'are used as qj^variatesjsee El^shoff, 1969; Lord, 1962), 
An illustration of the difficulties that often arise in measuring 

what seem to be absolute entities, occurred in a study at'LRDC, The ^ 

— i» , , 

^size of each classrooW utilizing the IJi^DC program was to be-measurecj 

by mak^ig'a sketch of *the cla^r^^ area with the dimensions specified. 

Usually,-^ this was done only once, because we felt we could reasonably 

assume that it is measured with a minimum of error*. On one occasion, 

' *the measurement of thevcla-es rooms was asked for ^ain in the same year. 

The resujlts were not at all C9nsi8tent,'"with shapes as wti,ll as d^mensions^ 
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changing. This experieijce Tias made us "extremely cautious about :the 
accuracy of all types of data regardless of their presumed simplicity, 

^ Coleman et aL (19.66), in the "Equality of Educational Oppor- 
tunity" study, empirically investigated the systematic nSeasuxement 
errpr that resulted from selected pa'rts of their procedures. Evaluators 
of all major longitudinai' latudies should consider estimating and report- 
ing the measurement error associated with their data. 

In Abt's FT evalxiation, the degree' of error in the WHAT pre- , 
test >Mas investigated. The report stated that: * 

The reliability of the pretest was calculated by each 
' ' * Follow Through S'ponsor-level s^nnple by a measure 
of "internal consistency (coefficient, alpha) and is on 
^ - the order of . 90 across these samples (see Appendix 
• Table A2-1). (Stebbins, 1976, p. A-60) 

It is, of course, important to report the specific? values for eac>i 
group on which an analysis is run, because any, time the value is low, 
the conclusions dra-jvn from it must be interpreted with much caution.* 
Even though 96 percent of the groups have very high reliability, specific 
§ites may fall within a large range of valuer. The reader needs to know 
' these values for specific analyses.^ It would be helpful if the'evaluato r 
initially set a reliability level (such as . 80) below which the covariate 
would not be us^. ^ This does not preclude a later decision to include a 
covariate that does not meet the criterion value, if there are unique com- 
pelling reasons to do so. The reliabilities and associated cautions should 
be reported, or at least noted,^in the textH^re the conclusions that they 
affect are reported. 

^We are unable to find^^f reliabilities for the different Sponsor- 
level samples irt Table 2A-1 of our copy of the Abt report, so we do not 
know the actual range of values obtained. But in any large* complex re- 
port, there will be some errors and omissions. ^ 

^See Glass, Peckharp, %^ Samlors, U972, for r4wl( ws nf el udi ra 
^ that investigated this concern. . ' ^ ^ . 
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A related point that must be^raised at this tii^e is the ques- 
tionable us,e of coefficient alph^as an estimate of error in a covariate, 

4 

such as the WRAT, This is no^intended as criticisrp of Abt's use of 
it# given the type of data available to them %nd^their general situation. 
In'fact, Abt*s attiempts to investigate the adequacy of the FT data fot 

t, r 

the ANCOVA model are to be commended.* But, it is important that 

better rnethods be identified -and utilized for testing the assumptions 

and setting appropriate criteria. *"Tukey {1954) and Wold (1956) 

explicated problems that arise as data analysis moves from experimental 

,to observational data, of which every researcher must be awaHPe. Work 

on these problems is needed for clarification of their implications for.- 

decision-oriented research. 

* As previously indicated, ng criteria related to the first and 

fourth assumptions were specified anc^sed m the national FT evaluation, 

although both assumptions were addressed.** The three, criteria that the 

Abt evaluation used and reported to indicate that "the adjustment pro- 

duced by ANCOVA may be misleading" (Stebbin, 1976, p^ A-71) were: 

1. when the relational^p bet^^een a given covariate 
and outcome is different for the treatment and • 
1^ comparison groups being analyzed; 



2, when the pretest difference between the treatment 
and comparison groups being analyzed is greater 
than five points (about one-haif of a standard 
deviation); and 

^ 3. when the percent of those attending preschool in 
each ^roup differed by more than 50 percent, 
(p. A-72) ^ -in 

The first criterion is essentfally the ANCOVA assumption that 

the treatijient groups have a common regreslion surface. The second is 

an indication that the treatment groups were not drawn from populations 

with the same covariat-e distributions. The third criterion could also 

be viewed as questioning the assumption that the groups were initially 
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injthe effects table. *Tlie specific c 



7 . • 

r. When any. of \;he8e conditions existed for a comparison, or 
of.compax'isons, their correspo idyig results w^re ''greyed-out 

riterion values that were'us^d in 



gr(Byii>g'Out a particular comparison ^re presented in the text of the 
te|>ort, and Which of theie were vic^lated in an analysis is specified in the 
effiects tables; This is vastly superior to presenling all information in 
a t^ble with^nd indication that som0 of the results are questionable. 

^In ^act, in certain situatio;|^s it ma^ be more appropriate to report only ' ^ 
th^ results of analyses that do meet all. of the rnininium criteria, rather 
thr^n merely *'greying-out" certain results.. 

Note that these thfee criteria h^ve associated with them explicit 

.value.s 6r dcwision rulqlB*. (such a^ statistical significance). Although the 
reader nriay disagree with the specific values set by the eraluator, s/he 
know9 when the ev^uator thinks tiie results are interpretable and what 
the specific criteria ^ re- on whichj the decisions 'are based^ Testing for 
the. violation of the condition9 and &ssumptiof^p needed for meaningful 
interpretation of ANCOVA results should be done in an experimental 

'setting; however, such testing is imperative in an evaluation or naturalistic 

setting, Where naturally confounded variables are alni&st certain to be 

^ • 
present, and little control of the situation is possible. 

Considerations^ should not be linvi^ed, however, to those dis- 
cussed and dealt v^ith in this paper, or in the Abt report. Others, that may 
be relevant iSi your setting may Aot l)ave, been included in the FT evalua- 
tion. For example, can the criterion-covariate regression be expressecl 
in linear -form for each treatment group? If this condition is violated, 
i. ^. , the data cannot be'transfprmed to linear form, comparison of two 
estimated treatment means will be biased. Elashoff notes ''that the 
effect of norflinearity is most severe when random assignment to groups 



7 ' 
See Campbell and Erlebacher (1970) for how this can lead to 

erroneous conclusions. 
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is not possible or prottction. against non-normality in the y's is lo^^st'^ . 
(1969, pp. 390-391V. This condition can be tested by exannning increases 
lii explained variation when higher order t^rins arc^i-ncluded in fhe re- " 
^ gression equation. Again, the evaluator should specify the ejptct ' - 

criterion value for tfef test. * , '"^ 

*- In suinmar/, the evaluator of any pragrani working in a 
naturalistic setting .will find that the data availa-ble will deviate f nam 
assumptions and <5bnditions necessary for eisy'injtj6rpretatiun of ANCOVA 
results. By recognizing this fact before impleVricnting the evaluation, 
detailed guidelines can be designed based on the purpose of, the evaluation 
aiid the evalua^on setting. As more data and knowledge^ are gained from 
a program especially a longitudinal one modifications in ,thes^ 
criteria may be necessary. But, these.' guidelines with ^hreir associated ^ 
rationales, their subsequent changes, and their impfications for the 
•conclusions drawn, are needed by the reader to , understand and interpret 

the results of an evaluation. 

* ~ * - <*■ 

4, * What criteria were utilized to determine if a covariate 
was^to be included in a specific analysis'^ 

Too often covariates are included irfdiscriminately in a s^t of 

ANCOVA analyses without knowledge of the local conditions or a theory of 

* 

how the variables interrelate. This Usually results in conservative esti- 
ijaates of treatment effects, due .to confounding of the cOyariates with 
the treatment. -We believe that the selection of covariates for an analy- ^ 
sis should be based a logical rationale, preferably one that is a part 
of a 'broader theoretical framework. Presenting the logical process us^d 
to idjgjptify candidate covariates indicates that the evaluator. has broadly 
coaceptualized the evaluation problen^. Also, unique conditions in a speci- 
fic situatiorT often require decisions to be made about the inclusion or 
exclusion of a covariate in a specific analysis. In addition to a 
theoretical basis for including covariates, guidelines for excluding them . 
are needed. These guidelines for including or excluding covariates 
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are usually based on the ^^everal assumptions of ANCOVA liat^d in- 
the previous section. * ^ 

All covariates that were assessed in Jthe evaluation and con- ^ 
sidered for use in a specific analysis should be listed in the repor.t. 
The ratidnales^for their incluS'ion should also bie presented.- This 
point was discussed extensi>t^ly in section two of this pag^r. ^ 

In large complex evaluations, numerous covariatjEjs are ^ 
assesse(i, but the specific ones used in different analyses often vary. 
This variation is usually related to one o^^e ANGOVA assumptions ' 
or conditions unique to the situation.- When the results of an ANCOVA 
analysis are presented, a list of th^e covariates considered tot user and 
the reason for excluding any from the analysis should be reposed. 
A hypothetical example of such a list is presented in Table 4.^ 
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Insert Table 4 about here 



The criteria to decide whether or not a covkriate should be 
included in a comparison should not be limited to the assumptions of 
ANCOVA-, For example, one fcriterion not utilised in the Abt report 
was the degree of relationship between the covariate and ouftome vari- 
able. This is an important factor for assessing the e'f^ectiveness'of 
the covariate. Cox (1957) compared th^ precision of blocking versus 
.cov?vriance itr di^ejrent values of p, i, e. , the correlation between the 
covariate and outcome variable. Cox concluded thst if p < * 04, 
blocking is preferable to co^ariance analysis; if p > 0. 6 covariance 
ts somewhat better; and if p > 0.8 c6variance analysis is appreciably 
Better. Although other factors affect this relationship, e.g., shape of 
covariate distributions, it is an important consideration that should be , 

used as a criterion for judging potential covariates* This consideration 

i 

is, of course, secondary, to the purpose of the evaluation and the speci-» 
fic questions being addressed in any 8tudy% » ^ 



The geaeral i^form^ipn, oa choosing covariates that would 
appear in a tablc^^imilar t^Table 4 should be supplemented with the 
specific values of correlatiorycoefficients for the aele^t'ed x:ovariates. 
This com14 be efficiently incor])orated into actable exhibiting the infor- 
Qxation necessary for the i^eader to reconstruct the regression equat 
tions of th0 comparisons that were actually made. ^ pertindnt illus- 
^tratipn can be found in the appendices of the SRI report (Etifi rick et al. , 
iS73). In TabJpS, the correlation with the'^cpd^ient variable, the raw 



Insert Table 5 about here 



^regression weight, the standardized regression, weight, and the 
Standard error of tJie regression coefficient are listed for each 
covariate. This should be accompanied by information* describing and 
elaborating on the ANCOVA comparison made, such as^unadjusted and 
adjusted x)ut come variable means, the standard error of the adjusted 
difference, the sample sizes, and the actual results of the tompariaon 
in the f6rn>of a computed statistic or confidence interval. 

^ , To siimmarize, we have recon^mended that the evalua^r 

delineate a logical rationale for. the selection.of variables as candidate, 
covariates, preferably b^S^d on;An overall theoretical frapieworkV ^ 

'Guidelines should be delinearted^for deciding when covariafes should 
not be inqluded in an^ analysis. The criteria specified should inp^iude, 
but not be^limited^o, the \ssumptions of ANCOVA. The results of 
this decisiipn process could be presented in a table sirnila^ to Table 5. 
Finally, th^ types of data that should be reported for each covariate ^ 



^ S^me indications that the adjusted ineans have no intrinsic 
Value, and tihat*comparison of the means with their associated unad- 
justed means is not us^ially meaningful, -shovild be included in the re- 
port. Of course, the difference btetwec^n the adjusted means is ysed 
in the computation of *^n.{istical significance. ^ 



and comparidon that allow the reader, to assess the interpretations de- 
rived from the ANCOVA comparisons, have been discussed. In lar^e 
complex evaluation reports, wo have-often found, it very difficult to 
identify the variables tha^t were included in an analysis, let alone find 
the reasons Why*a particular variable was or was not included. Cri- 
teKia that might be used to decide wl>ich covariates to exclude from an 
analysis, and methods for clearly presenting the' associated information 
in an evaltiation repdrt, need further investigeition^and development. 

3. Are the different groups included in the ANCOVA 

analyses and their educational experiences described 
adequately in the report*? 

Much of the technical information required to assess the 
appropriateness of the interpretation of the evaluation results has be"en 
specified in the preceding sections. We have commented on the need 
for the evaluatOE to: (a) link tTie data anllyseg and re&earch hypotheses 
to the general evaluation questions^ (b) specify and link the measures 
*and variables utili2^ed in an evaluation to the domains of interest, (c) 
'^tate the criteria used to decide if an analysis should be made and interr 
preted, .and (d) specify the criteria used to select covariates for a 
specific analysis. 'In addition to these concerns, several ©the rs per- 
taining to the groups' characteristic s *Lnd experiences aYe needed for ^ 

? 3 ' ' 

the conclusions t(ybe interpreted appropriately. 

Knowledge of the educational conditions and treatments that 
' the different groups experienced is of central importance in inferpretr 
ing the results', of any program evaluation. A detailed description oi 
the programs experienced by students is a major undertaking, as 
evidenced by the extensive work in FT of Stallings (1973) and Cooley 
and ^--einharcjt (1975a). 'Obviously, fhe extensiveness of the program 
descriptions repX^ented by these studies usually cannot be 'achieved 

4 ^ : ^ ^ 

when conducting an impact evaluation, but the identification of some 

j» - • . 

essential context and program^variables should be maSe by evalUators 
in any setting.. Ignoring differences between intende^lfreatments and 

^ - \ 28 • ■ ' " ■ 
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those actually experienced, and between characteristics of the exp^ri- 
men^al^iiid comparison groups, can lead to erroneous conclusions about 
.the relative impact of the variables being evali^ated. Also, little or 
no ]jJiowledge is gained al>out how the obtained outcomes had been 
affected by i^nporta/it program variables. 

Follow Through again provides a relevant exarhple. The FT 
evaluation was intended to assess the impact of the FT prog ram on 
participating, children, as compa^red to the impact of "regular" school 
experiences that did not include innovative educational programs. How- 
ever, the '^regula*" school programs, feerving^ the NFT comparison 
children often included other compensatory programs, such aa Title I, 
which at times, utilized educational materials' and practices similar to 
those in som6 Sponsors' FT instructions^ ijiodels. As a result, when < 
a FT/NFT comparipon is made, .the appropj|;iate interpretation of the 
results is not immediately apparent. Differences between FT and NFT 
groups ,and their educational experiences must be integrated with the 
reporting of results. A hypothetical example might be, "the NFT 
children at the Oshkoshi A^ska, site wer.e ^milar to the participating 
FT children at the site on all entry characteristics measured. Because 
the NFT chitai^en vyere from familie^ whose incomes were very lo\^, 
they qualified and 'participated in the Title I federal conipensatory edu- 
catibn program. This involved supplemental instruction in 'arrthematic 

^and xeacfing and additional aid. ..." This type of information is needed 
by the reader to interpret the results^ith respect to the educational 
variables actually beings asses-sed, and the degree to which differences 

'tn outcomes might be expected. 

I^ii addition to considerations about the comparability of the 

! • ' ' ' • ' 

educational conditions and materials the different groups experience, 

the evaluator mvist report information aboC^t the similarities and dif- 

JferenC^s between the groUps experiencing the program^beiAg evaluated . 

and those comprising the ^mparison group. In previoi^s sections. 



the necessity to report raw knd adjusted means* on the covaj-iates and 
the dependent variabl€L.s was noted, Suggestidns of how and where -to • 
report the info pmation were also indicated. Other aspects of eatch/ 
unique evaluation setting must also be taken into accotint. Within FT, 
some o/these tjonsiderations rela.te to attrition and missing daifea, pro- 
gram, requTtements for participation, atid local implemcntaUoii and^ 
utilization of the program. • 

Attrition and missing data commonly affect the final composi- 
tion pf the groups being cor^ared. ^Attrition occurs wheo a participating 

'students moves out of the FT classroom. Missing data occurs when one 
or more measurefnentr for a participating student are missing. Due to 

4^hese two factors, the composition of groups in the FT evaluation "has - 
been shown to undergo drastic changes during the course of a four y^dar 

'eduoational- program. For examp^e^the Abt report states that 

- "approximately. 50 percent of tfie FT and NFT children who were tested 
in the kindergarten year of Cohort II were not present at the end of 

Jhird grade" (Stebbifts, 1976, p. A-47)v 

* 

Empirical investigation can be utilized to determine whether 
attrition o^- missing data bias a comparison. The Abt evaluators com- 
flared rates for FT and NFT' students at each site, using their pretest 
scores' and family income data. ' Five sites- were found for which attri- 
tion significajatly changed the difference between groups^ pretest scores, 
and three sites' were identified for which attrition altered the tT/NFT 
''difference in mean income. No explanation was given in the report for 
the selection or limitation of the investigation to these two variables. 

A procedure was used in the Abt jreport to estimate values for 
the missing data for eovariates. Whether or uot a covariate value was 
•estimated was then noted in the^analysis. Several advantages to this 
procedure were riote.d in the Abt report: ^ 

it avoids the risk of nonrepresen^ativpness due to' 



^dropping children; 
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^ it avoids the lo^s of statistical power due to re- 
duced samjple size; 

^ it U6e^ the information contained in the absenqe ^ 
presen^of the. variable; an.d 

it uses the ilfiformation present on other varia(bles 
for children who might have been dropped other- 
wise. (StebbinsT 1976, p. Af5l)^ 
* ' J ' 

In.any large-scale longitudinal J^^j^tib;^ the evaluator wUl 

have the task of selecting from numerous-4lternative approachjes for 

handling missing data, including dropping such persons from all 

Analyses. Each situation will dictate considerations that will' influence 

the decision rules for handlii)g missing data. We su^est^that thes,e 

rviles and their rationales be made explicit. How the estimatibn of ^ 

missing data affects the a^ssiSn^ition that the measures ^re perfectly 

reliabfe, and how the interpretation pf the results might be affected 

must be considered. , - • 

Federal requirements' for the FT' program also affected the. 

compos it ionWf the FT and NFX groups: ^ ^ * 

V • ^ , ' 

•Children enrolled in^ early elementary grades Tiiay ' 
participate in [FT] projec^. ... At least 50 percent 
4 of tbd children in each enreVing class sh^U ^e children 

^ who havei^reviously participated in a *full-year, Head 
* Start or similar quality prescjiool program and ,who 
were low Income at the time of enrollment in such 
preschool program. . . ^ ("Follow Through Program", 
1975, 11714.11715)' 

As a result, entering kinder*gartep^<5hildf en could not , be randomly 

selected for participation in FT. At somVsites*, those fftudepts below 

A 

poverty level were assigned to the FT clas^oom while students from 
higher income families were assigned to the regular classrooms and 
often *becain,e part of the NFT comparison groups at the site. The 
descriptive data do indicate the existence of this systematic bias 
caused by program requirements (see Table 6/. 
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. ' * y InUrt Table 6 about here 

Local decisions about itnfjlementation and utilization of the FT 

program are more difficult to document, but no less a problem for 

'adequate interpretation ^ resiUts. ^ For ejcample, local administrators 

often used FT as a remedial program. Studeats who were repeating a 

^ade or who had special needs were o:ften placed in th^ FT classroom/ 

The Abt report^also indicated. that a systematic bias exist s^gain'st 

the FT g^^up: 1 ^ ' 

In most^^^ses the Follow Thr^ough participants^ were 
4* selected frojm among the "most difficult" in the 

community. . . ^ome communities chose to include the 
^mentally handicapped and /or emotipnally disturbed. 
(St^bbinsj 1976, pp. A-IZ-IS)', 

Although this was not the case at all sites, it does document 
that the "more difficult" students were placed in FT classes. The 
effects of these differences, such as the inclusion of emotionally dis- 
tur)Ded children-asd grade repeaters, usually remain unknown because 
they are not assessed by the covariates and* are not investigated in 
other ways eithjsr. * r' ' ' ^. . 

, These examples emphasize the need for detailed description^ 

of the grpups being comparted, and their educational experiences. ^ , 
This information sl^uld be coordinated with.thje reporting of results 
at the ^ite level, «ince. program interpretations depend upon the 
simUarity of the groups. The result^ section of the A6t -report did 
describe FT/NFT group comparability both in tabular an^ prose forriTfe. 
For example, tKe FT and N^T groups at a particular site were des- 

cribed^n the res'ults secti6n 'for that sitfe: , 

*^ * ^ 

The FT group'is also well below sponsor average in 
^ income, while the NFT i^^-sfbout average for this 

sponsor. . . • The two groups are a fairly close match 
on entry WRAT and ethnic composition, though the NFT 
income level is copsider^jy higber than the FT level. 

1951 



(Stebbins,^ 1976^ 'p. A- 
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'This information permits the reader of th*e report to make more 
appropriate interpretations erf conclusion^ and other summary state- 
ments made by the- evaluator about a specific kite by making him/her 
aware of the similarities. in entry WRAT and ethnic composition and 
the considerable 'difference in income. ^ 
" ■ Summary 

The purpo'se of this paper wair*to indicate some sjjecific in« 
formation that should be included in. an evaliaation* repbVt when ANCOVA 
type techniques are used, in order, to allow the reader to assess the 
adequacy of the analyses and the appropriateness of the evaluator's , 
interpretations of the res\ilts. A recurrent theme of this pap^r ha&' 
been that the evaluator must recpgnize that the applicatl^ of ANCOVA 
in evaluation settings requires a more elaborate analysis and reportijig. 
Strategy than in experimental studies; due to the-failure to meet 
ass^imptions of ANCOVA precisely, and the existence of numerous 
plausible alternative' interpretations of the results. The evaluator must 
recognize that important aspects of the evaluation should be described* 
in detail,. e, , setting, treatments, characteristics of the participants 
and nonparticipants, and their educational experiences. 

The major points elaborated in the paper are 'summarized . 

below. Those activities' that- evalua tors often fail to carry out, or cri- * 

teria that are sometimes not specified in the section of the report 

where they could be mo-st useful, are emphasized; ^ 

1. Specify^the hypothesis actually tested by an analysis, 

^ rather than only relating the analysis to a general 

evalxiation. * * 

« 

.2. Describe the variables used in each analysis, the 
rationale and procedure*for selecting the measures, 
and the relationships among the mea>8ures., variables,^ 
and evaluation questions, 

3. Use explicit criteria ^o decide whether or not to ^ 
make a specific analysis, and report the extent 
to which an analysis meets thu^t^e criteria! 
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4, Use explicit criteria to decide whether to include, ; 

covariate in,a specific analysis* ^ 

5. Describe thejgroups included-^n the^ ANCOVA-type 
analysis by reporting: 

a) — adjust ed and unadjusted raw or standard 
means on the depehdent variable(s) for 



^each group, 
9 



h) summary statistics for each group on the 
covariates used in each analysis, and 

c) a detailed description of the educational ex- ^ 
periences, of the program group and of any 
" comparison groups, * 

. These points were made in an effort to reduce the ambiguity 

that often ensutes when reporting'the results of ANCOVA techniques in 

complex longitudinal evaluations* These considerations are intended 

to improve ev^tluators* abilities to^communicat^ their findings accurate 

to the nonstatistically-oriented reader. A special effort is needed to 

indicate the limitations of an evaliiation as well as its strengths, so 

that a n)ore balaaiced and accurate picture of a prografii and its 

effects is presented to the 'decision maker, who may be^Juzzled and 

awed by the mathematical procediares. y 
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Table I 

Sample Summary of Effects Tabl« 
(Stebbins, 1976, p.' 
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SITE A 


* SITE B 

i» ^ i 


Local 
Matched 
Pooled . 


^ j:: 

J 2 a. 


Total Reading 


+ + + 




Total Math 




- 


Spelling 






Language 






Raven's 






Coone r smith 


i 




lARS (+) 






lARS (-) 






Word Knowledge 


+ + -f 




Reading 






Math Concepts 






Math Computations 






Math Problem 
Solving 






Language Part A 






Language Part B 




* 



38 



36 
I 



Table 2 * _ 

Domains, Variibles, and Measures Used in * 
National Evaluation of Follow Through Program* 



Domain 
J ' ' Cognitive 



Variable 



easure 



Total Reading 
Total Math 
Spelling 

Language ^ 
Problem Solving 



Metropolitac^/ Achievement Test 
Metropolitaii ^Achievement Test 
I*^etropolitan Achievement Test 
Metropolitan Aci^vement Test 
Raven^s Progressive Matrices 



Non-Cognitive 



Self-Concept, or^elf-Esteem 
Locus of Control 



Coopersmith 

Individual AcHievement Respon- 
sibility Scale 



These were used to assess thjrd grade affects in the Abt evaluation (Stebbins, 1976). 



■. * . - . Table 3 





Fall Metropolitan Readiness Test. Results 
for Kindergarten_£tudents in LRDC 
and Comparison Schools 




♦ 




School 


Fall Mean 


N . ' 






^t; 

LRDC School 1 
, Comparison 'School 1 


36.22 
28.32 


63 


/ 




LRDC School 2 
Comparison Schpol 2 


37. 1,9 
22.69 


42 
42 




* 


LRDC School 3 
Comparison School 3 


29. 3| 
23. 03^ 


87 

130 

* 
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0 Table 4 ' 



Hypothetic^^ Table of.C9variates Considered for 
an ANCOVA Analysis and Re^taons for Dropping Those 
That Were Excluded , 



Covariate Criteria Failed Covariates Included 







* 




Fall Kindergarten WRA.T 


1*, 4 




p * 


Preschool Experience^ ^ 


2 ' ' 






Sex 






X , 


Ethnic Membership 




s 


X 


- Occupation 


4 







=i=The nximbers refer to the assumptions of ANCOVA lifted in se^on 3. 




Tabl4>S 



COHORT I, KINDERGARTEN. REGRESSION ON OONTI^L VARIABLES: 
PUPlL OtTCbuES FOR PBOJSCT AKALVSES (M r 3^^, RESIDUAL df « 25t)^ 



OOVARl \BL£ 



I 



FALL « 1969 

QLA>*T. PR£SCOR£ 
doc. FROCESS PRESCOI^ 
RCADI.NC 'pre SCORE 
LANGIACE PRESCORE 
XFFtCT PRESCORE 
A\. PLP1L*AGE, (MONTHS) 
% CL1SSR00^AALE 
% CLASSRO^^U\CK i 
% ENCLISm|t LANG. 
H PRESCHOO^OR ^K). MOS, ) 
% PARENTS W/CMfs 0»PL. * 
PARENTS « SlllhL OCCtP, 
% PARENTS BLACK 
\ PARENTS POVtRn* ELIGIBLE 
\ HE\D HOUSEHOLD EMPLOYED 
HEAD HOUSEHOLD UALS 



S. D. 



ACHIEVE3IENT 
RgC«E&SJ[ON COEFFICIENTS 

RAW STD S. E 



VRAT 

RECtESSION COEFFICIENTS 



AFFECT 
RECRSSSIO.N COEFFICIE 
StD 



ABSENCE 
RECRESSlds CPtfTICUHTS 

' »b # HAW m S. g. 



. $LVUAR\ STATl 



pics 



UtAS' 

variance 
miltiple'r* 




varia.nc£ i<jf!m cov*s AJ^ 

ELIMINATED , 



from Em rick et al. (1 




0 



,0i2 .^2 

.044 .7M 

.015 .S3T ^ 

.006 ,t54 

.020 .600 

.015 .153 

.213 .oy 

.027 .>27 
.^16 ).02^ 
.036 .Oi2 
.081 • .014 * 
)042, J»\3 
, 1^3 liTMf^' 
.063 .014 . 
.144 .016 
.OQl .015 



973, p. ^.%). 
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K Table 6 

Descriptive. Characterii^tics of the National Population, 
the Follow Through Sample, an<f the Non-FoUow Jihrdugh Sampl 





National 


FT 


NFT 


Median. Income 


$9590 


$4450 


$6060 


% Minorities 


13% 


. kVk ' 


; 79% 


% Preschool 


*" 9% 


81% 


57% 


FaU WRA^ , 


NA 

♦ 


- 29.7 


29.4 
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• ^ » Appendix A * 

, ■ ■ ■ ^ ■ ^ . ' ■ 

Given the peed to provide information to decision makers, the essential 
t^oblem becomes the development of an evaluation approach that will provide the 
most Aralid' and comprehensive informatioh possible. To this end. Follow Through 
evalu^ation plannep (USOE, SRI) adopted a qua si -experimental design, selecting 
^ at each sit^ a comparison group as similar to the treatrrt^t gi^oup as possible. 
Since this design does not suggest a single ^'appropriate^* analysis, we have sub- 
jected tile data to a variety of ^'approxirnately appropriate" analytic p'rocedurfeg, so 
as not to be overiy confined, by the drawbacks ^f the design. The multiple strate- * 
gies approach anticipa^d the common and valuable practice of performing secondary 
analyses such, as those performed on the Equality of EducLttonal Opportunity data. 
Any single analytic treatment of quasi- experimental data is inevitably subject to 
, well-founded methodolbgical^criticism, especially when the data are bein'^ used to 
assess the jmp^ct of major educational progrtrhfe. Subsequift reanalyses using 
other techniques andfapproaches help to assess the validity of the original*result8^ 
Usually, after several reanalyses have been? accomplished and a body of literature 
accumulated, all available ififormation'i^ *ffttegrated to refine and clarify unde?*- 
standing of the problem (or prt)grain),^^ iour analytic cross-validation anticipates 
some of the mote obvious re^nal^s^esA^i^^^ould provide other researchers with a ' 
broader basis ibr designing further thoughtful approa^^hes to the Follow Through data. 

y - ■ ■ . ■. • ■ . ■ • 
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, ^ ♦This appendi3f is qxioted from g|ebbi^ (1976, p. A^46J. 
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^ Nfcp* GoRnitiye Outcomeg and ClassVoom Ehvironroent 

Although the RFP calls for consideration of the **nonachievement 
.factors which contribute to classz^oom environment,^' it does hot spell out 
what these factors might Ue«- It was suggested that tHe dedign^er review ** 
this ar^and prop>o8e what definitions and instrumentation, if any, should 
be included in the Individualized Instruction Study, Our approach to this 
task has been twa-pronged: {1} to det#rmine whethej non-cognitiye student 
outcomes can and should be measured, and, (2) to determine whether it is 
possible ^d desirable to^ assess the effect of ^rograitiis on^the total class- 
room environment. 

^ ^e do not recornmend that noii- cognitive student outcomes be assessed 
in study for two reasons. First, although schooling, individualized or 
not, may indeed have an effect on some non-cogfiitive outcomes, the theore- 
-tical basis for such a belief is not well developied, Wifhout^a soxind basis,' 
it-is f^ltile to a^ttempt to measure *non- cognitive or social outcomes since it 
is not clear what to measure or how tg make cau^l ^arguhi^nts if effects 
kre foxind* A second argument against the test of social outcomes is that 
their measurement in the primary grades is still in a primitive state* 

Our consideration of non- cognitive or socjal outcomes began with the ^ , 

generation of a list of outcomes that designers of instructional, programs • 

* \ ' I ^ / * ^ 

have. claimed will be affe<*ted by their programs (^g. , self-concept, inquiry 

' ' ^ / ' 

skills, autonomy) ♦ The next step was to locate^. instruments that purport to* 

measure these specific outcomes* Hhe short duration of the study ruled 

out the possibility qf developing such instrliments from scratch. Existing 

instruiViehts, were located, screened, and eliminated from further considera- 

tion it they failed to meet any one of the following criteria: 



^This appendix quoted from Coaley and Leixihai^dt (1975 b). 
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K The^strument cdiild not be highly correlated with read- 
, ing and rriathematics ability. If it were, it would 
/• • • measure little not already measured by the/achievement 
tfest battery. ' 

2. * The instrxament had ^o measure the-social variables in 

question, i.e., it had to^e valid as measured by'standard 
measures of validity. ^ 

3. The instrument had to be reliabVe as measured by atan- 
" , dard measjares of reliability. 

4. The ifistrxament must have^beon 4e8igned or adapted for ' ^ 
use in the primary grades. 

5. The instrument must be usable ^rom aif acihainistrative 
standpoint. This criterion would rule out instrumepts 
that are described in tHe literature but are otherwise 

^ , untraceable, those that require axi ex<)rbitant Amount of 
pupil/ examiner time (in excess of three hours pei?^ pupil), 
an^ thoae that require a highly trained examiner or coder. 
A ntfember of projective tests like doll-play were eliminated 
under this criterion. ' 

Jlhe resiilts of the search for an instrument that would meet these 

criteria were disappointing. Not one instrument of the many considered was 

*" ** . ■• 

totally acceptable. Table 3. 1 li&ts Some of the tests that were rejected and 

a cri^ei*ion they failed. They nviy have failed other criteria, but this informa 

tion was not recorded because the test r,eviewefs eliminated an instrument 

t , ^ . 

ilp^n failure to meet pne criteribn. » ^ 



